sampling bias
Robust Correction of Sampling Bias using Cumulative Distribution Functions
Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datasets. We present a new method for handling covariate shift using the empirical cumulative distribution function estimates of the target distribution by a rigorous generalization of a recent idea proposed by Vapnik and Izmailov. Further, we show experimentally that our method is more robust in its predictions, is not reliant on parameter tuning and shows similar classification performance compared to the current state-of-the-art techniques on synthetic and real datasets.
On the Origins of Sampling Bias: Implications on Fairness Measurement and Mitigation
Zhioua, Sami, Binkyte, Ruta, Ouni, Ayoub, Ktata, Farah Barika
Accurately measuring discrimination is crucial to faithfully assessing fairness of trained machine learning (ML) models. Any bias in measuring discrimination leads to either amplification or underestimation of the existing disparity. Several sources of bias exist and it is assumed that bias resulting from machine learning is born equally by different groups (e.g. females vs males, whites vs blacks, etc.). If, however, bias is born differently by different groups, it may exacerbate discrimination against specific sub-populations. Sampling bias, in particular, is inconsistently used in the literature to describe bias due to the sampling procedure. In this paper, we attempt to disambiguate this term by introducing clearly defined variants of sampling bias, namely, sample size bias (SSB) and underrepresentation bias (URB). Through an extensive set of experiments on benchmark datasets and using mainstream learning algorithms, we expose relevant observations in several model training scenarios. The observations are finally framed as actionable recommendations for practitioners.
- North America > United States (0.14)
- Africa > Middle East > Tunisia > Sousse Governorate > Sousse (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- (3 more...)
- Information Technology > Security & Privacy (0.45)
- Law (0.34)
- Education (0.34)
Robust Correction of Sampling Bias using Cumulative Distribution Functions
Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datasets. We present a new method for handling covariate shift using the empirical cumulative distribution function estimates of the target distribution by a rigorous generalization of a recent idea proposed by Vapnik and Izmailov. Further, we show experimentally that our method is more robust in its predictions, is not reliant on parameter tuning and shows similar classification performance compared to the current state-of-the-art techniques on synthetic and real datasets.
Effect Inference from Two-Group Data with Sampling Bias
Zachariah, Dave, Stoica, Petre
In many applications, different populations are compared using data that are sampled in a biased manner. Under sampling biases, standard methods that estimate the difference between the population means yield unreliable inferences. Here we develop an inference method that is resilient to sampling biases and is able to control the false positive errors under moderate bias levels in contrast to the standard approach. We demonstrate the method using synthetic and real biomarker data.